Midterm

Author

Jazmin Hernandez

Introduction

The dataset Nutrition, Physical Activity, and Obesity, was acquired from the Centers for Disease Control and Prevention through the Youth Risk Behavior Surveillance System. In this dataset, there is information on high school students in grades 9-12 from public and private schools regarding their diet, physical activity, and weight. This data helps inform the Division of Nutrition, Physical Activity, and Obesity which in turn contributes to national and state data on these markers.

The question I will be exploring and analyzing is “Does the class pertaining to obesity/weight status differ by state and/or ethnicity?”

Methods

The data were acquired from the Youth Risk Behavior Surveillance System where surveys were given to national, state, territorial, tribal, and large urban schools from grades 9-12 in U.S. high schools. Students were randomly selected to participate based on their required classes or a specific period of the school day. I used the API pertaining to the data but had to modify the default limit to allow all 44,702 observations and 31 columns to allow for all data to show.

I assessed missing values for the key variables which included:locationdesc, geolocation, data_value, race_ethnicity, and class. I then filtered the dataset for only relevant class observations including “Obesity/Weight Status.” To better visualize and compare data, I transformed the variable “question” into a binary variable where 1 = “Percent of students in grades 9-12 who have an overweight classification” and 2 = “Percent of students in grades 9-12 who have obesity.”

For exploratory data analysis, I used ggplots and bar plots to assess individual variables and explore their distribution. From here I discovered that the questions had equal counts, NY had the highest data value counts, and the race/ethnicity that was most common in this sample was Non-Hispanic White individuals.

# Exploring individual variables 
library(ggplot2)
ggplot(data, aes(x = factor(questions_coded))) + 
  geom_bar() + 
  theme_minimal()

library(ggplot2)
ggplot(data, aes(x = factor(locationabbr))) + 
  geom_bar(width = 0.5) + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# looking at distribution of ethicity
ggplot(data, aes(x = factor(race_ethnicity))) + 
  geom_bar() + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Preliminary Results

To help with data visualization, I separated states into regions pertaining to Northeast, Midwest, South, and West. From here, I was able to create a bar plot with the average data values according to Question 1 and Question 2 stratified by region. This visual helped in determining the first part of my question if obesity/weight status differed by state. To help answer the second part of my question, I created another bar plot of average data value by race/ethnicity and question. For the creation of maps to help visualize the association, I first separated and cleaned the geolocation variable into latitude and longitude columns. I created a preliminary visual map distribution of where the surveys were administered in the U.S. high schools. I created summary statistics in tabular form of race/ethnicity by question 1 and question 2 where 2 or more races, Asian, Hispanic, Non-Hispanic Black, and Non-Hispanic White had higher mean data values for Question 1. American Indian/Alaska Native and Hawaiian/Pacific Islander had higher mean data values for question 2.

To create a final representation to help answer my question, I filtered the data to include Question 1 and Question 2 separately. I then created separate maps to visualize the distribution of ethnicity/race and mean data values of the questions separately. By doing this, I could interact with the map and compare the mean data values (in percent) of how obesity/weight status differed by location and by race/ethnicity. The average data value percentage of obesity/weight for Question 1 was of 17.62% in the following locations: Alaska, South Carolina, North Carolina, Georgia, Michigan, and Wyoming. The race/ethnicity in all these locations was American Indian/Alaska Native. For Question 2, the percentage of obesity/weight status was again 17.62% which was the highest in North Dakota and Alaska. The ethnicity were also American Indian/Alaska Native.

# Bar plot of average data value by region 
ggplot(data_with_region, aes(x = region, y = data_value_alt, fill = question)) +
  stat_summary(fun = mean, geom = "bar", position = "dodge", width = 0.7) +
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.2, 
               position = position_dodge(0.7)) +
  theme_minimal() +
  labs(title = "Average Data Value by Region and Question",
       x = "Region",
       y = "Average Data Value") +
   theme(axis.text.x = element_text(angle = 45, hjust = 1)) 
Warning: Removed 3090 rows containing non-finite outside the scale range
(`stat_summary()`).
Removed 3090 rows containing non-finite outside the scale range
(`stat_summary()`).

# Bar plot of average data value by ethnicity
ggplot(data_with_region, aes(x = race_ethnicity, y = data_value_alt, fill = question)) +
  stat_summary(fun = mean, geom = "bar", position = "dodge", width = 0.7) +
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.2, 
               position = position_dodge(0.7)) +
  theme_minimal() +
  labs(title = "Average Data Value by Race and Question",
       x = "Race/Ethnicity",
       y = "Average Data Value") +
   theme(axis.text.x = element_text(angle = 45, hjust = 1)) 
Warning: Removed 3090 rows containing non-finite outside the scale range
(`stat_summary()`).
Removed 3090 rows containing non-finite outside the scale range
(`stat_summary()`).

# Visualizing distribution of where surveys were administered 
library(leaflet)
leaflet(data_with_region) |>
  addTiles() |>  
  addMarkers(
    lng = ~longitude, 
    lat = ~latitude, 
    popup = ~as.character(locationdesc)  
  )
leaflet(data_with_region) |>
  addProviderTiles('CartoDB.Positron') |>
  addCircles(
    lat = ~latitude, 
    lng = ~longitude, 
    radius = 50000, 
    color = ~case_when(
      race_ethnicity == "2 or more races" ~ "red",
      race_ethnicity == "American Indian/Alaska Native" ~ "blue",
      race_ethnicity == "Asian" ~ "green",
      race_ethnicity == "Hawaiian/Pacific Islander" ~ "purple",
      race_ethnicity =="Hispanic" ~ "yellow",
      race_ethnicity == "Non-Hispanic Black" ~ "pink",
      race_ethnicity == "Non-Hispanic White" ~ "orange"
   
    ),
    fillOpacity = 0.5,
    popup = ~sprintf(
      "<strong>State:</strong> %s<br>
       <strong>Ethnicity:</strong> %s",
      locationdesc, race_ethnicity
    )
  ) |>
  addLegend(
    position = "bottomright",
    colors = c("red", "blue", "green","purple", "yellow", "pink", "orange"),
    labels = c("2 or more races", "American Indian/Alaska Native", "Asian", "Hawaiian/Pacific Islander", "Hispanic", "Non-Hispanic Black", "Non-Hispanic White"),
    title = "Ethnicities in each location"
  )
library(knitr)
summary_table <- data_with_region |>
  group_by(race_ethnicity, questions_coded) |>
  summarise(
    Mean = mean(data_value, na.rm = TRUE),
    Median = median(data_value, na.rm = TRUE),
    Count = n(),
    SD = sd(data_value, na.rm = TRUE),
    .groups = 'drop'  # This prevents warnings about grouped data
  )

# Create a nice table with kable
kable(summary_table, caption = "Summary Statistics by Race/Ethnicity and Question")
Summary Statistics by Race/Ethnicity and Question
race_ethnicity questions_coded Mean Median Count SD
2 or more races 1 16.192233 15.80 437 3.862444
2 or more races 2 14.881068 15.00 437 4.160720
American Indian/Alaska Native 1 17.268041 17.40 437 3.417832
American Indian/Alaska Native 2 17.978351 18.10 437 5.906797
Asian 1 11.909023 11.20 437 4.008995
Asian 2 8.405263 7.80 437 3.754942
Hawaiian/Pacific Islander 1 15.641667 16.25 436 3.927818
Hawaiian/Pacific Islander 2 18.045833 17.90 436 6.499564
Hispanic 1 17.701739 17.50 438 3.111718
Hispanic 2 16.522899 16.20 438 4.191128
Non-Hispanic Black 1 18.041404 18.00 437 2.983058
Non-Hispanic Black 2 17.114386 17.10 437 3.737133
Non-Hispanic White 1 13.556368 13.60 437 1.965989
Non-Hispanic White 2 11.366038 11.10 437 3.291120
# Create a leaflet map for mean value Question 1 by race/ethnicity
leaflet(data_question1) |>
  addTiles() |>  # Add default OpenStreetMap tiles
  addCircleMarkers(
    ~longitude,
    ~latitude,
    radius = ~Mean * 0.5,  # Scale the radius by the mean value (adjust as necessary)
    color = ~case_when(
      race_ethnicity == "2 or more races" ~ "red",
      race_ethnicity == "American Indian/Alaska Native" ~ "blue",
      race_ethnicity == "Asian" ~ "green",
      race_ethnicity == "Hawaiian/Pacific Islander" ~ "purple",
      race_ethnicity == "Hispanic" ~ "yellow",
      race_ethnicity == "Non-Hispanic Black" ~ "pink",
      race_ethnicity == "Non-Hispanic White" ~ "orange",
      
    ),
      fill = TRUE,  # Ensure fill is set to TRUE
    fillOpacity = 0.5,  # Set desired fill opacity
    stroke = TRUE,  # Ensure stroke is drawn
    weight = 1,  # Set stroke weight
    
    popup = ~sprintf(
      "<strong>locationdesc:</strong> %s<br>
      <strong>Race/Ethnicity:</strong> %s<br>
       <strong>Mean Value:</strong> %.2f",
       
  locationdesc, race_ethnicity, Mean 
    
    )
  ) |> 
  addLegend(
    position = "bottomright",
    colors = c("red", "blue", "green", "purple", "yellow", "pink", "orange"),  # Use c() to create a vector
    labels = c("2 or more races", "American Indian/Alaska Native", "Asian", "Hawaiian/Pacific Islander", "Hispanic", "Non-Hispanic Black", "Non-Hispanic White"),
    title = "Race/Ethnicity"
  )
# Create a leaflet map for mean value Question 2 by race/ethnicity
leaflet(data_question2) |>
  addTiles() |>  # Add default OpenStreetMap tiles
  addCircleMarkers(
    ~longitude,
    ~latitude,
    radius = ~Mean * 0.5,  # Scale the radius by the mean value (adjust as necessary)
    color = ~case_when(
      race_ethnicity == "2 or more races" ~ "red",
      race_ethnicity == "American Indian/Alaska Native" ~ "blue",
      race_ethnicity == "Asian" ~ "green",
      race_ethnicity == "Hawaiian/Pacific Islander" ~ "purple",
      race_ethnicity == "Hispanic" ~ "yellow",
      race_ethnicity == "Non-Hispanic Black" ~ "pink",
      race_ethnicity == "Non-Hispanic White" ~ "orange"
    ),  # Removed the trailing comma here
    fill = TRUE,  # Ensure fill is set to TRUE
    fillOpacity = 0.5,  # Set desired fill opacity
    stroke = TRUE,  # Ensure stroke is drawn
    weight = 1,  # Set stroke weight
    
    popup = ~sprintf(
      "<strong>locationdesc:</strong> %s<br>
      <strong>Race/Ethnicity:</strong> %s<br>
      <strong>Mean Value:</strong> %.2f",
      locationdesc, race_ethnicity, Mean  # Ensure Locationabbr is a valid column
    )
  ) |> 
  addLegend(
    position = "bottomright",
    colors = c("red", "blue", "green", "purple", "yellow", "pink", "orange"),  # Use c() to create a vector
    labels = c("2 or more races", "American Indian/Alaska Native", "Asian", "Hawaiian/Pacific Islander", "Hispanic", "Non-Hispanic Black", "Non-Hispanic White"),
    title = "Race/Ethnicity and mean value for Question 2"
  )

Conclusion

After exploratory data analysis and the creation of graphs, it appears that obesity/weight status is higher in the American Indian/Alaska Native population. This is true for Question 1 “Percent of students in grades 9-12 who have an overweight classification” and Question 2 “Percent of students in grades 9-12 who have obesity.” However, there is a greater distribution of students falling into Question 1 category as seen in that more states had a higher percent mean value of this distribution compare to Question 2. This is also seen in the bar plot comparing average data values by region stratified by question 1 and 2. Question 1 appears to have a higher average data value than question 2 in all regions.